Evaluate Business Location Using Pedestrian Traffic: Day and Night
Authored by: Barkha Javed, Weiran Kong
Duration: 75 mins
Level: Intermediate Pre-requisite Skills:Python
Scenario
As a business owner, I want to know how much pedestrian foot traffic occurs around me during the day and night, so that I can evaluate the suitability of the location and hours for my business.
Busy foot traffic in a business area may not always mean busy foot traffic at night.
As a business owner, I want to know how much pedestrian foot traffic I get compared to surrounding areas, so that I can assess if it is better to adapt my hours, extend or move locations.
Foot traffic flow during day or night may indicate adapting staff levels during specific hours.
Significantly low foot traffic in comparison to other streets may mean adapting business strategy or moving location.
Duration of steady foot traffic from early morning to mid afternoon only, may indicate adapting business hours to match.
What this Use Case will teach you
At the end of this use case you will understand how to:
A brief introduction to the datasets used
Briefly about each dataset and intended use
The exploratory data analysis on day and night showing initial findings is available, please refer to compare pedestrian traffic day night
Accessing and Loading data
#Libraries to be installed
##!pip -q is to give less output
!pip -q install sodapy
!pip -q install seaborn
!pip -q install pandas
!pip -q install matplotlib
!pip -q install numpy
!pip -q install nbconvert
!pip -q install keyboard
!pip -q install geopandas
!pip -q install sklearn
!pip -q install folium
#load libraries
import os
import time
import keyboard
from datetime import datetime
import numpy as np
import pandas as pd
from sodapy import Socrata
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import style
style.use('ggplot')
import plotly.graph_objs as go
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
import folium
from folium.plugins import MarkerCluster
#Socrata client connection
client = Socrata('data.melbourne.vic.gov.au', '9UAGAvkem9fqXDhBLElapjazL', None)
#Pedestrian sensor location data
sensor_data_id = "h57g-5234"
results = client.get(sensor_data_id)
sensor_location = pd.DataFrame.from_records(results)
#sensor_location.head(5)
sensor_location[['latitude', 'longitude']] = sensor_location[['latitude', 'longitude']].astype(float)
sensor_location = sensor_location.drop('location',axis=1)
sensor_location['lat'] = sensor_location['latitude'].apply(lambda x: float(x))
sensor_location['lon'] = sensor_location['longitude'].apply(lambda x: float(x))
#View sensor data
sensor_location.info()
sensor_location.head(5)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 91 entries, 0 to 90 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sensor_id 91 non-null object 1 sensor_description 91 non-null object 2 sensor_name 91 non-null object 3 installation_date 91 non-null object 4 status 91 non-null object 5 note 10 non-null object 6 latitude 91 non-null float64 7 longitude 91 non-null float64 8 direction_1 78 non-null object 9 direction_2 78 non-null object 10 lat 91 non-null float64 11 lon 91 non-null float64 dtypes: float64(4), object(8) memory usage: 8.7+ KB
| sensor_id | sensor_description | sensor_name | installation_date | status | note | latitude | longitude | direction_1 | direction_2 | lat | lon | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16 | Australia on Collins | Col270_T | 2009-03-30T00:00:00.000 | R | Device moved to location ID 53 (22/09/2015) | -37.815734 | 144.965210 | NaN | NaN | -37.815734 | 144.965210 |
| 1 | 50 | Faraday St-Lygon St (West) | Lyg309_T | 2017-11-30T00:00:00.000 | A | NaN | -37.798082 | 144.967210 | South | North | -37.798082 | 144.967210 |
| 2 | 73 | Bourke St - Spencer St (South) | Bou655_T | 2020-10-02T00:00:00.000 | I | NaN | -37.816957 | 144.954154 | East | West | -37.816957 | 144.954154 |
| 3 | 66 | State Library - New | QVN_T | 2020-04-06T00:00:00.000 | A | NaN | -37.810578 | 144.964443 | South | North | -37.810578 | 144.964443 |
| 4 | 59 | Building 80 RMIT | RMIT_T | 2019-02-13T00:00:00.000 | A | NaN | -37.808256 | 144.963049 | South | North | -37.808256 | 144.963049 |
#Pedestrian foot count data
#Uncomment the below to open data source, download sensor data, and store it as a csv locally.
sensor_data_id = "b2ak-trbp"
# results = client.get(sensor_data_id, limit=7000000)
# ds_traffic = pd.DataFrame.from_records(results)
# ds_traffic.to_csv('sensor_traffic.csv', index=False)
sensor_traffic = pd.read_csv('sensor_traffic.csv')
sensor_traffic.info()
print(f'\nThe shape of dataset is {sensor_traffic.shape}. \n')
print('Below are the first few rows of this dataset: ')
sensor_traffic.head(10)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4463670 entries, 0 to 4463669 Data columns (total 10 columns): # Column Dtype --- ------ ----- 0 id int64 1 date_time object 2 year int64 3 month object 4 mdate int64 5 day object 6 time int64 7 sensor_id int64 8 sensor_name object 9 hourly_counts int64 dtypes: int64(6), object(4) memory usage: 340.6+ MB The shape of dataset is (4463670, 10). Below are the first few rows of this dataset:
| id | date_time | year | month | mdate | day | time | sensor_id | sensor_name | hourly_counts | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2887628 | 2019-11-01T17:00:00.000 | 2019 | November | 1 | Friday | 17 | 34 | Flinders St-Spark La | 300 |
| 1 | 2887629 | 2019-11-01T17:00:00.000 | 2019 | November | 1 | Friday | 17 | 39 | Alfred Place | 604 |
| 2 | 2887630 | 2019-11-01T17:00:00.000 | 2019 | November | 1 | Friday | 17 | 37 | Lygon St (East) | 216 |
| 3 | 2887631 | 2019-11-01T17:00:00.000 | 2019 | November | 1 | Friday | 17 | 40 | Lonsdale St-Spring St (West) | 627 |
| 4 | 2887632 | 2019-11-01T17:00:00.000 | 2019 | November | 1 | Friday | 17 | 36 | Queen St (West) | 774 |
| 5 | 2887633 | 2019-11-01T17:00:00.000 | 2019 | November | 1 | Friday | 17 | 29 | St Kilda Rd-Alexandra Gardens | 644 |
| 6 | 2887634 | 2019-11-01T17:00:00.000 | 2019 | November | 1 | Friday | 17 | 42 | Grattan St-Swanston St (West) | 453 |
| 7 | 2887635 | 2019-11-01T17:00:00.000 | 2019 | November | 1 | Friday | 17 | 43 | Monash Rd-Swanston St (West) | 387 |
| 8 | 2887636 | 2019-11-01T17:00:00.000 | 2019 | November | 1 | Friday | 17 | 44 | Tin Alley-Swanston St (West) | 27 |
| 9 | 2887637 | 2019-11-01T17:00:00.000 | 2019 | November | 1 | Friday | 17 | 35 | Southbank | 2691 |
#Add date column
sensor_traffic['date'] = pd.to_datetime(sensor_traffic.date_time).dt.date
#Add day of week column
sensor_traffic['dow'] = pd.to_datetime(sensor_traffic.date_time).dt.day_of_week
#convert sensor_id to integer
sensor_traffic['sensor_id']=sensor_traffic['sensor_id'].astype(int)
sensor_location['sensor_id']=sensor_location['sensor_id'].astype(int)
# Mesh pedestrian sensor location and foot traffic datasets
sensor_ds = pd.merge(sensor_traffic, sensor_location, on='sensor_id')
#Simply using the year to differentiate all the years prior to 2020 as before Covid, and post 2019 Covid traffic to now
sensor_ds['pre2020_hourly_counts'] = np.where(sensor_ds['year']<2020,sensor_ds['hourly_counts'] , 0)
sensor_ds['post2019_hourly_counts'] = np.where(sensor_ds['year']>2019,sensor_ds['hourly_counts'] , 0)
#Add column for day or night traffic
#Add column for day (5am to 5pm) or night (6pm to 4am) traffic
sensor_ds['day_counts'] = np.where(((sensor_ds['time']>4) & (sensor_ds['time']<18)),sensor_ds['hourly_counts'] , 0)
sensor_ds['night_counts'] = np.where(sensor_ds['day_counts']==0,sensor_ds['hourly_counts'], 0)
sensor_ds.describe()
| id | year | mdate | time | sensor_id | hourly_counts | dow | latitude | longitude | lat | lon | pre2020_hourly_counts | post2019_hourly_counts | day_counts | night_counts | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4.463670e+06 | 4.463670e+06 | 4.463670e+06 | 4.463670e+06 | 4.463670e+06 | 4.463670e+06 | 4.463670e+06 | 4.463670e+06 | 4.463670e+06 | 4.463670e+06 | 4.463670e+06 | 4.463670e+06 | 4.463670e+06 | 4.463670e+06 | 4.463670e+06 |
| mean | 2.232320e+06 | 2.017230e+03 | 1.574741e+01 | 1.146917e+01 | 2.700226e+01 | 4.951352e+02 | 2.999995e+00 | -3.781345e+01 | 1.449621e+02 | -3.781345e+01 | 1.449621e+02 | 4.074119e+02 | 8.772332e+01 | 3.594659e+02 | 1.356693e+02 |
| std | 1.289316e+06 | 3.551188e+00 | 8.800020e+00 | 6.938594e+00 | 1.951539e+01 | 7.509822e+02 | 2.000015e+00 | 6.187582e-03 | 8.531568e-03 | 6.187582e-03 | 8.531568e-03 | 7.443612e+02 | 2.852713e+02 | 7.131327e+02 | 3.910920e+02 |
| min | 1.000000e+00 | 2.009000e+03 | 1.000000e+00 | 0.000000e+00 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | -3.782402e+01 | 1.449297e+02 | -3.782402e+01 | 1.449297e+02 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 1.115918e+06 | 2.015000e+03 | 8.000000e+00 | 5.000000e+00 | 1.100000e+01 | 4.300000e+01 | 1.000000e+00 | -3.781874e+01 | 1.449587e+02 | -3.781874e+01 | 1.449587e+02 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 50% | 2.231836e+06 | 2.018000e+03 | 1.600000e+01 | 1.100000e+01 | 2.300000e+01 | 1.780000e+02 | 3.000000e+00 | -3.781381e+01 | 1.449651e+02 | -3.781381e+01 | 1.449651e+02 | 6.200000e+01 | 0.000000e+00 | 2.400000e+01 | 0.000000e+00 |
| 75% | 3.347753e+06 | 2.020000e+03 | 2.300000e+01 | 1.700000e+01 | 4.100000e+01 | 6.050000e+02 | 5.000000e+00 | -3.781102e+01 | 1.449669e+02 | -3.781102e+01 | 1.449669e+02 | 4.530000e+02 | 2.400000e+01 | 3.520000e+02 | 5.900000e+01 |
| max | 4.469141e+06 | 2.022000e+03 | 3.100000e+01 | 2.300000e+01 | 8.700000e+01 | 1.597900e+04 | 6.000000e+00 | -3.779432e+01 | 1.449747e+02 | -3.779432e+01 | 1.449747e+02 | 1.597900e+04 | 1.443700e+04 | 1.161200e+04 | 1.597900e+04 |
flag_value=0
df_day=sensor_ds.query("day_counts > @flag_value")
print('Day info\n', df_day.info(),'\n')
df_night=sensor_ds.query("day_counts == @flag_value")
print('Night info \n',df_night.info(),'\n')
<class 'pandas.core.frame.DataFrame'> Int64Index: 2402520 entries, 0 to 4463663 Data columns (total 27 columns): # Column Dtype --- ------ ----- 0 id int64 1 date_time object 2 year int64 3 month object 4 mdate int64 5 day object 6 time int64 7 sensor_id int32 8 sensor_name_x object 9 hourly_counts int64 10 date object 11 dow int64 12 sensor_description object 13 sensor_name_y object 14 installation_date object 15 status object 16 note object 17 latitude float64 18 longitude float64 19 direction_1 object 20 direction_2 object 21 lat float64 22 lon float64 23 pre2020_hourly_counts int64 24 post2019_hourly_counts int64 25 day_counts int64 26 night_counts int64 dtypes: float64(4), int32(1), int64(10), object(12) memory usage: 504.1+ MB Day info None <class 'pandas.core.frame.DataFrame'> Int64Index: 2061150 entries, 1 to 4463669 Data columns (total 27 columns): # Column Dtype --- ------ ----- 0 id int64 1 date_time object 2 year int64 3 month object 4 mdate int64 5 day object 6 time int64 7 sensor_id int32 8 sensor_name_x object 9 hourly_counts int64 10 date object 11 dow int64 12 sensor_description object 13 sensor_name_y object 14 installation_date object 15 status object 16 note object 17 latitude float64 18 longitude float64 19 direction_1 object 20 direction_2 object 21 lat float64 22 lon float64 23 pre2020_hourly_counts int64 24 post2019_hourly_counts int64 25 day_counts int64 26 night_counts int64 dtypes: float64(4), int32(1), int64(10), object(12) memory usage: 432.4+ MB Night info None
Separate day and night, and before and after Covid for mapping
#split dataset to map difference between time before covid-19 and time after covid-19
df = df_day
#print(df_day.head(2))
df_daypft_beforecovid = df.loc[df['year']<2020]
df_daypft_aftercovid = df.loc[df['year']>2019]
#get average hourly count for each sensor during the selected period of time
df_daypft_beforecovid_avg = df_daypft_beforecovid[['sensor_id','sensor_description','lat','lon','hourly_counts']]
df_daypft_beforecovid_avg = df_daypft_beforecovid_avg.groupby(['sensor_id','sensor_description','lat','lon',],as_index=False).agg({'hourly_counts': 'mean'})
df_daypft_aftercovid_avg = df_daypft_aftercovid[['sensor_id','sensor_description','lat','lon','hourly_counts']]
df_daypft_aftercovid_avg = df_daypft_aftercovid_avg.groupby(['sensor_id','sensor_description','lat','lon',],as_index=False).agg({'hourly_counts': 'mean'})
Examine Pedestrian Traffic
Pedestrian traffic has decreased after 2019, the aim is to try to understand patterns of day and night traffic.
#examine pre Covid and post 2019 foot traffic
ds = pd.DataFrame(sensor_ds.groupby(["dow","day"])["pre2020_hourly_counts","post2019_hourly_counts"].mean())
df = ds.sort_values(by=['dow'])
axs = df.plot.bar(figsize=(12, 4), color=["teal","orange"])
axs.set_title('Foot Traffic by Day of Week', size=20)
axs.set_ylabel('Average hourly counts', size=14)
plt.show()
#examine pre Covid and post 2019 foot traffic
ds = pd.DataFrame(sensor_ds.groupby(["time"])["pre2020_hourly_counts","post2019_hourly_counts"].mean())
df = ds.sort_values(by=['time'])
axs = df.plot.line(figsize=(20, 6), color=["teal","orange"])
axs.set_title('Foot Traffic by Time', size=20)
axs.set_ylabel('Average counts', size=14)
plt.show()
Examine, if day and night traffic vary by location
#distribution by traffic, by day
pivot = pd.pivot_table(sensor_ds, values='day_counts', index=['sensor_id','sensor_description'], aggfunc=np.mean)
pivot_ds = pivot['day_counts'].nlargest(n=20)
pivot_ds.plot.bar(figsize=(12, 5), color='orange', legend=True);
#by night
pivot = pd.pivot_table(sensor_ds, values='night_counts', index=['sensor_id','sensor_description'], aggfunc=np.mean)
pivot_ds = pivot['night_counts'].nlargest(n=20)
axs = pivot_ds.plot.bar(figsize=(12, 5), color='navy', legend=True);
axs.set_title('Top 20: Foot Traffic by location by day and night', size=20)
axs.set_ylabel('Average counts', size=14)
plt.show()
Day Economy
We can see changes in the traffic of each sensor before Covid and after Covid on the map below. Please right click on the icon in the top right corner of the map and select the layer to see traffic before Covid.
#Visualise day data
m = folium.Map(location=[-37.8167, 144.967], zoom_start=15) # tiles='Stamen Toner'
locations = []
for i in range(len(df_daypft_beforecovid_avg)):
row =df_daypft_beforecovid_avg.iloc[i]
location = [(row.lat,row.lon)]*int(row['hourly_counts'])
locations += location
marker_cluster = MarkerCluster(
name='day traffic before covid-19',
locations=locations,
overlay=True,
control=True,
color='cyan',
show=False # this removes from automatic selection in display - need to select to show data points
)
marker_cluster.add_to(m)
locations = []
for i in range(len(df_daypft_aftercovid_avg)):
row =df_daypft_aftercovid_avg.iloc[i]
location = [(row.lat,row.lon)]*int(row['hourly_counts'])
locations += location
marker_cluster = MarkerCluster(
name='day traffic after covid-19',
locations=locations,
overlay=True,
control=True,
)
marker_cluster.add_to(m)
folium.LayerControl().add_to(m)
m